Corpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research

نویسنده

Susanne Haaf

چکیده

This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines, more specifically to the DTA ›Base Format‹ (DTABf). The paper shows that by including TEI-XML structuring in corpus-based analyses significances can be observed for different linguistic phenomena, as e.g. the development of conceptual text structures themselves, the syntactic embedding of terms in certain conceptual text structures, and phenomena of language change which become obvious via the layout of a text. The exemplary study carried out here shows some of the potential for the exploitation of TEI annotation for linguistic research, which might be kept in mind when making design decisions for new corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus

It is well known that standardising the annotation of language resources significantly raises their potential, as it enables re-use and spurs the development of common technologies. Despite the fact that increasingly complex linguistic information is being added to biomedical texts, no standard solutions have so far been proposed for their encoding. This paper describes a standardised XML tagse...

متن کامل

Computer-Assisted Processing of Intertextuality in Ancient Languages

The production of digital critical editions of texts using TEI is now a widely-adopted procedure within digital humanities. The work described in this paper extends this approach to the publication of gnomologia (anthologies of wise sayings), which formed a widespread literary genre in many cultures of the medieval Mediterranean. These texts are challenging because they were rarely copied strai...

متن کامل

Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora

The Corpus Encoding Standard (CES) is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994). It provides encoding conventions for linguistic corpora designed to be optimally suited for use in language engineer...

متن کامل

Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene

The paper describes a tool developed to process historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, morphosyntactic tags and lemmas. Such a tool is useful for developing historical corpora of highly-inflecting languages, enabling full text search in digital libraries of historical texts, for modernising such texts for today's readers and m...

متن کامل

The DTA 'base format': A TEI-subset for the compilation of interoperable corpora

This article describes a strict subset of TEI P5, the DTA ‘base format’, which combines the richness of encoding noncontroversial structural aspects of texts while allowing only minimal semantic interpretation. The proposed format is discussed with regard to other commonly used XML/TEI schemas. Furthermore, the article presents examples of good practices showing how external corpora can either ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Corpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research

نویسنده

چکیده

منابع مشابه

Encoding Biomedical Resources in TEI: The Case of the GENIA Corpus

Computer-Assisted Processing of Intertextuality in Ancient Languages

Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora

Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene

The DTA 'base format': A TEI-subset for the compilation of interoperable corpora

عنوان ژورنال:

اشتراک گذاری